Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Deep convolutional neural networks have proven their effectiveness, and have been acknowledged as the most dominant method for image classification. However, a severe drawback of deep convolutional neural networks is poor explainability. Unfortunately, in many real-world applications, users need to understand the rationale behind the predictions of deep convolutional neural networks when determining whether they should trust the predictions or not. To resolve this issue, a novel genetic algorithm-based method is proposed for the first time to automatically evolve local explanations that can assist users to assess the rationality of the predictions. Furthermore, the proposed method is model-agnostic, i.e., it can be utilised to explain any deep convolutional neural network models. In the experiments, ResNet is used as an example model to be explained, and the ImageNet dataset is selected as the benchmark dataset. DenseNet and MobileNet are further explained to demonstrate the model-agnostic characteristic of the proposed method. The evolved local explanations on four images, randomly selected from ImageNet, are presented, which show that the evolved local explanations are straightforward to be recognised by humans. Moreover, the evolved explanations can explain the predictions of deep convolutional neural networks on all four images very well by successfully capturing meaningful interpretable features of the sample images. Further analysis based on the 30 runs of the experiments exhibits that the evolved local explanations can also improve the probabilities/confidences of the deep convolutional neural network models in making the predictions. The proposed method can obtain local explanations within one minute, which is more than ten times faster than LIME (the state-of-the-art method).
translated by 谷歌翻译
Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training approach prevents these models transferring category-specific knowledge explicitly from labeled data to unlabeled data, which can lose high-level semantic information and impair model performance. To mitigate above limitations, we present a novel model called Decoupled Prototypical Network (DPN). By formulating a bipartite matching problem for category prototypes, DPN can not only decouple known and novel categories to achieve different training targets effectively, but also align known categories in labeled and unlabeled data to transfer category-specific knowledge explicitly and capture high-level semantics. Furthermore, DPN can learn more discriminative features for both known and novel categories through our proposed Semantic-aware Prototypical Learning (SPL). Besides capturing meaningful semantic information, SPL can also alleviate the noise of hard pseudo labels through semantic-weighted soft assignment. Extensive experiments show that DPN outperforms state-of-the-art models by a large margin on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/Lackel/DPN.
translated by 谷歌翻译
我们提出了一个端到端的讲座视频生成系统,该系统可以直接从注释的幻灯片,讲师的参考语音和讲师的参考肖像视频中生成现实和完整的讲座视频。我们的系统主要由语音合成模块组成,具有很少的扬声器适应器和基于对抗性学习的说话头生成模块。它不仅能够减少讲师的工作量,还可以改变语言和口音,这可以帮助学生更轻松地跟随讲座,并能够更广泛地传播讲座内容。我们的实验结果表明,所提出的模型在真实性,自然性和准确性方面优于其他当前方法。这是一个视频演示,展示了我们的系统的工作原理以及评估和比较的结果:https://youtu.be/cy6tyki0cog。
translated by 谷歌翻译
本文回顾了AIM 2022上压缩图像和视频超级分辨率的挑战。这项挑战包括两条曲目。轨道1的目标是压缩图像的超分辨率,轨迹〜2靶向压缩视频的超分辨率。在轨道1中,我们使用流行的数据集DIV2K作为培训,验证和测试集。在轨道2中,我们提出了LDV 3.0数据集,其中包含365个视频,包括LDV 2.0数据集(335个视频)和30个其他视频。在这一挑战中,有12支球队和2支球队分别提交了赛道1和赛道2的最终结果。所提出的方法和解决方案衡量了压缩图像和视频上超分辨率的最先进。提出的LDV 3.0数据集可在https://github.com/renyang-home/ldv_dataset上找到。此挑战的首页是在https://github.com/renyang-home/aim22_compresssr。
translated by 谷歌翻译
动态面部表达识别(FER)数据库为情感计算和应用提供了重要的数据支持。但是,大多数FER数据库都用几个基本的相互排斥性类别注释,并且仅包含一种模式,例如视频。单调的标签和模式无法准确模仿人类的情绪并实现现实世界中的应用。在本文中,我们提出了MAFW,这是一个大型多模式复合情感数据库,野外有10,045个视频Audio剪辑。每个剪辑都有一个复合的情感类别和几个句子,这些句子描述了剪辑中受试者的情感行为。对于复合情绪注释,每个剪辑都被归类为11种广泛使用的情绪中的一个或多个,即愤怒,厌恶,恐惧,幸福,中立,悲伤,惊喜,蔑视,焦虑,焦虑,无助和失望。为了确保标签的高质量,我们通过预期最大化(EM)算法来滤除不可靠的注释,然后获得11个单标签情绪类别和32个多标签情绪类别。据我们所知,MAFW是第一个带有复合情感注释和与情感相关的字幕的野外多模式数据库。此外,我们还提出了一种新型的基于变压器的表达片段特征学习方法,以识别利用不同情绪和方式之间表达变化关系的复合情绪。在MAFW数据库上进行的广泛实验显示了所提出方法的优势,而不是其他最先进的方法对单型和多模式FER的优势。我们的MAFW数据库可从https://mafw-database.github.io/mafw公开获得。
translated by 谷歌翻译
在本文中,我们主要关注如何通过借口任务(例如旋转或颜色置换等)学习其他特征表示形式的其他特征表示形式。借口任务产生的这种附加知识可以进一步提高几次学习(FSL)的性能,因为它与人类通知的监督(即FSL任务的类标签)有所不同。为了解决此问题,我们提出了插入式层次树结构感知(HTS)方法,该方法不仅了解FSL和借口任务的关系,而且更重要的是,可以自适应地选择和汇总由借口任务生成的特征表示,以最大化FSL任务的性能。引入了层次树构造组件和封闭式选择汇总组件来构建树结构并找到更丰富的可转移知识,这些知识可以迅速适应具有一些标记的图像的新颖类。广泛的实验表明,我们的HTS可以显着增强多种几次方法,以在四个基准数据集上实现新的最新性能。该代码可在以下网址获得:https://github.com/remimz/hts-eccv22。
translated by 谷歌翻译
分子和形态特征是生物分类学的重要部分,是矛盾的,但需要整合。如今,有机体的图像识别和生物信息学正在出现和热门问题,但它们之间存在差距。在这项工作中,由遗传信息介导的一个多分支识别框架桥接了这个障碍,该障碍建立了宏观形态学和蘑菇的微分子信息之间的联系。提出了新型的多角度结构来融合三个分支模型的特征图像,从而显着提高了识别的准确性约10%,高达90%以上。此外,通过使用遗传距离嵌入作为预测图像距离和物种识别的表示空间,将遗传信息实现到蘑菇图像识别任务中。还首次深入讨论了传统分类任务的语义过度拟合和细粒图像识别的粒度。使用零拍学习任务在细粒度的情况下研究了该模型的普遍性,这可以预测看不见样本的分类和进化信息。我们提出了第一种将图像映射到DNA的方法,即使用编码器映射图像来遗传距离,然后通过预先训练的解码器解码DNA,其中37种DNA预测的总检验准确性为87.45%。这项研究通过系统地研究蘑菇图像识别问题,弥合宏观生物学信息和微观分子信息之间的差距,从而创建一个新颖的识别框架,这将为未来的智能生物识别技术提供新的参考。
translated by 谷歌翻译
大芬基的物种鉴定,即蘑菇,一直是一项具有挑战性的任务。仍然有大量有毒的蘑菇,这对人们的生命构成了风险。但是,传统的识别方法需要大量在手动识别的分类学领域具有知识的专家,而且不仅效率低下,而且消耗了大量的人力和资本成本。在本文中,我们提出了一个基于注意力机构的新模型,Mushroomnet,该模型将轻型网络MobilenetV3应用于骨干模型,并结合了我们提出的注意力结构,并在蘑菇识别任务中实现了出色的性能。在公共数据集上,Mushroomnet模型的测试准确性已达到83.9%,在本地数据集上,测试精度已达到77.4%。提出的注意机制很好地将注意力集中在蘑菇图像的身体上,以进行混合通道注意力,并通过GRAD-CAM可视化的注意热图。此外,在这项研究中,将遗传距离添加到蘑菇图像识别任务中,将遗传距离用作表示空间,并且数据集中每对蘑菇物种之间的遗传距离被用作遗传距离表示的嵌入空间,以预测图像距离和物种。确认。我们发现,使用MES激活函数可以很好地预测蘑菇的遗传距离,但精度低于软疗法。拟议的蘑菇网已被证明,它显示出自动和在线蘑菇图像的巨大潜力,拟议的自动程序将有助于并参考传统的蘑菇分类。
translated by 谷歌翻译
注册森林环境的点云是精密林业局部激光雷达应用的必要先决条件。最先进的森林点云登记方法需要提取单个树属性,并且在处理具有致密树的真实森林点云时,它们具有效率的瓶颈。我们提出了一种自动,坚固,高效的方法,用于登记森林点云。我们的方法首先定位树从原料点云茎,然后根据他们的相对空间关系确定准变换茎匹配。相较于现有的方法,我们的算法不需要额外的单株属性,具有线性复杂的环境中的树木数量,允许它的大森林环境对齐点云。广泛的实验表明,我们的方法优于关于登记精度和稳健性的最先进的方法,并且在效率方面显着优于现有技术。此外,我们引入一个新的基准数据集,补充的开发和注册方法评价森林点云的极少数现有的开放的数据集。
translated by 谷歌翻译